---
title: "Extended statistical reports"
output:
  pdf_document: default
  html_notebook: default
  word_document: default
  html_document:
    df_print: paged
---







###Setup

Load libraries
```{r}

library(tidyverse)
library(knitr)
library(nlme)
library(ggbeeswarm)
library(emmeans)
library(kableExtra)
library(GGally)
library(qqplotr)
library(gridExtra)
library(RColorBrewer)
library(car)
library(effects)


```

### Load in the fish data
```{r}
master_data <- read_csv("C:/Users/kelse/Desktop/fishData_compiledForStats_v6.csv")
```

Get rid of the indexing column from Python
```{r}
master_data$X1 <- NULL

```

Make some variables that read in as numerics into factors (categorical variables)
```{r}
master_data$speed <- factor(master_data$speed)
master_data$segment <- factor(master_data$segment)
```


Just in case, let's make two smaller tables - one for bluegill, one for trout
```{r}
bluegill_data <- master_data %>%
  filter(species == "bluegill")

trout_data <- master_data %>%
  filter(species == "trout")
```



Set up some basic themes for plots
```{r}
papertheme <- theme_bw() + theme(axis.line = element_line(color="gray"), 
                                 axis.ticks = element_line(color="gray"), 
                                 panel.border = element_blank(),
                                 strip.background = element_blank(),
                                 legend.title = element_blank())
```







Let's drop left/right distinction from final data set.

```{r}
master_data_nosides <- master_data %>%
  
  select(species, individual, sequence, speed, period, freq, waveSpeed, waveLength, tbamp, Re, St,
         netCFxWholeBody, p1maxCFxWholeBody, p1minCFxWholeBody, 
         p1maxTimeCFxWholeBody, p1minTimeCFxWholeBody,
         p2maxCFxWholeBody, p2minCFxWholeBody, p2maxTimeCFxWholeBody, p2minTimeCFxWholeBody,
         netCFyWholeBody, maxCFyWholeBody, minCFyWholeBody, maxTimeCFyWholeBody, minTimeCFyWholeBody,
         segment, pressType, forceType,
         ampInSegm, ampMaxTimeInSegm, ampMinTimeInSegm, bodyAngleInSegm, 
         bodyAngleMaxTimeInSegm, bodyAngleMinTimeInSegm,
         AoAInSegm, AoAMaxTimeInSegm, AoAMinTimeInSegm,
         meanCpsInSegmSidesAvg, absMeanCpsInSegmSidesAvg,
         maxCpsInSegmSidesAvg, minCpsInSegmSidesAvg,
         maxTimeCpsInSegmSidesAvg, minTimeCpsInSegmSidesAvg, 
         meanCFxInSegmSidesAvg, netCFxInSegmSidesAvg, meanCFxByMechSidesAvg, netCFxByMechSidesAvg,
         meanCFyInSegmSidesAvg, netCFyInSegmSidesAvg,
         absMeanCFxByMechSidesAvg, absNetCFxByMechSidesAvg, absPeakCFxByMechSidesAvg, 
         peakCFxByMechSidesAvg, peakTimeCFxByMechSidesAvg, areaUnderCurveByMechSidesAvg,
         ampAtPeakByMechSidesAvg, angleAtPeakByMechSidesAvg
         ) %>%
  unique()


bluegill_data_nosides <- master_data_nosides %>%
  filter(species == 'bluegill')

trout_data_nosides <- master_data_nosides %>%
  filter(species == 'trout')

thrust_data_nosides <- master_data_nosides %>%
  filter(forceType == 'thrust')


drag_data_nosides <- master_data_nosides %>%
  filter(forceType == 'drag')


speciescomp_data_nosides <- master_data_nosides %>%
  filter(speed == "2.5")

```





# CFx models

Questions/Hypotheses we're interested in:
-How does force change in each segment across speed or species?
-How does thrust/drag compare across high to low pressure type?
-How does thrust/drag in segments compare across pressure type?


Stats analysis completed following the protocol in Zuur et al.
Zuur, A. F., Ieno, E. N., Walker, N., Saveliev, A. A. and Smith, G. M. (2009). 
  Mixed effects models and extensions in ecology with R. New York, NY: Springer New York.



### CFx (species comparison) model

Do a test for NA values - see if we have missing data
```{r rows.print=56}
sc_test_for_na <- speciescomp_data_nosides %>%
  group_by(forceType, pressType, species, segment) %>%
  summarize(n=sum(!is.na(absMeanCFxByMechSidesAvg))) %>%
  filter(n<24)

sc_test_for_na
```


Do a test for 0's - see how many we have
```{r rows.print=14}
sc_test_for_zero <- speciescomp_data_nosides %>%
  group_by(forceType, pressType, species, segment) %>%
  summarize(n=sum(absMeanCFxByMechSidesAvg==0)) %>%
  filter(n>0)

sc_test_for_zero
```


####build model

#####Find random term structure

Start with a basic linear model and a basic mixed effects model for comparisons (we believe that individual has to be a random effect).

```{r}
m2 <- gls(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment, data = speciescomp_data_nosides)
```



```{r}
m3 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
                          random = ~1|individual, data = speciescomp_data_nosides)

```


compare models
```{r}
anova(m2,m3)
```



Do a p-value correction for testing on the boundary
```{r}
0.5*(1-pchisq(6.221792, 1))
```
This p-value is still below our significance threshold, so we accept individual as a random intercept term.

We are data-limited, so take this as our random structure (we don't have enough data to fit random slopes)




#####Check for heterogeneity



Plot residuals
```{r}
plot(m3)
```

So, we see there is heteroskedasticity.  But where does it come from?


```{r, fig.width=12, fig.height=3}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = forceType, y = resid(m3))) 
    + geom_boxplot() + papertheme + ggtitle("force type"),
  
  ggplot(speciescomp_data_nosides, aes(x = pressType, y = resid(m3))) 
    + geom_boxplot() + papertheme + ggtitle("pressure type"),
  
  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(m3))) 
    + geom_boxplot() +papertheme + ggtitle("species"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(m3))) 
    + geom_boxplot() + papertheme + ggtitle("segment"),
  
  
  ncol=4
)

```





Ok, so it seems like all factors have some level of unequal variances.  Segment particularly, and maybe species and pressType.



```{r, fig.width=8, fig.height=6}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(m3))) 
    + geom_boxplot() + papertheme + ggtitle("segment"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(m3), color = forceType)) 
    + geom_boxplot() + papertheme + ggtitle("force type"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(m3), color = pressType)) 
    + geom_boxplot() + papertheme + ggtitle("pressure type"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(m3), color = species)) 
    + geom_boxplot() +papertheme + ggtitle("species"),
  
  ncol=2
)

```

The patterns vs segment didn't change much when broken up by the other factors.
Trout seems to consistently have more variance than bluegill.
Pressure type doesn't have a consistent pattern - lo and hi both have larger variances depending on the segment.
Force type is pretty equal except in the last segment.




```{r, fig.width=8, fig.height=6}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(m3))) 
    + geom_boxplot() + papertheme + ggtitle("species"),
  
  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(m3), color = forceType)) 
    + geom_boxplot() + papertheme + ggtitle("force type"),
  
  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(m3), color = pressType)) 
    + geom_boxplot() + papertheme + ggtitle("pressure type"),
  
  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(m3), color = segment)) 
    + geom_boxplot() +papertheme + ggtitle("segment"),
  
  ncol=2
)

```


As above, see that trout variance is larger vs segments, 
but hard to say what the difference is between species based on pressType or forceType.



#####Find variance structure

Look for what could be causing issues with residuals by marking them up with color.

```{r, fig.width=8, fig.height=6}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = fitted(m3), y = residuals(m3), color = segment)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("segment"),
  
  ggplot(speciescomp_data_nosides, aes(x = fitted(m3), y = residuals(m3), color = species)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("species"),
  
  ggplot(speciescomp_data_nosides, aes(x = fitted(m3), y = residuals(m3), color = forceType)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("forceType"),
  
  ggplot(speciescomp_data_nosides, aes(x = fitted(m3), y = residuals(m3), color = pressType)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("pressType"),
  
  ncol = 2
)


```

Segment and pressType are probably the issue (the really big values all from segment 1 and 7, hi pressure),
but trout does have some larger residuals than bluegill


Let's try some variance structures out.

Note:
varIdent(form =~ 1|forceType*pressType*species*segment) nor 
varIdent(form =~ 1|pressType*species*segment) nor
varIdent(form =~ 1|pressType*segment)

converge.  So, let's do what we can...

Let's do groups of 2 and then single.


```{r}

vfsc1 <- varIdent(form =~ 1|species*segment)

vfsc2 <- varIdent(form =~ 1|pressType*species)

vfsc3 <- varIdent(form =~ 1|segment)

vfsc4 <- varIdent(form =~ 1|pressType)

vfsc5 <- varIdent(form =~ 1|species)


```




Set up REML models with different variance structures.

```{r}

m3.1 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
               random = ~1|individual,
               data = speciescomp_data_nosides,
               weights = vfsc1)

m3.2 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
               random = ~1|individual,
               data = speciescomp_data_nosides,
               weights = vfsc2)

m3.3 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
               random = ~1|individual,
               data = speciescomp_data_nosides,
               weights = vfsc3)

m3.4 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
               random = ~1|individual,
               data = speciescomp_data_nosides,
               weights = vfsc4)

m3.5 <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
               random = ~1|individual,
               data = speciescomp_data_nosides,
               weights = vfsc5)

```



Compare using AIC
```{r}
AIC(m3, m3.1, m3.2, m3.3, m3.4, m3.5)
```



The m3.1 model wins.





Check residuals.
```{r}
plot(m3.1)

```


Better than model without the unequal variance specification.



#####Look for optimal fixed structure

Start by finding out if highest-order interaction term is significant.
```{r rows.print=16}
anova(m3.1)
```

4-way interaction term is significant, so nothing can be dropped.



For fun, let's do a goodness-of-fit test for the model terms:
```{r}
car::Anova(m3.1)
```


Again, 4-way interaction is highly significant, so we keep all the terms.




##### Final descriptive CFx model (species comparison)
Force magnitude depends on force type, pressure type, species, and segment when the effect of individuals is controlled for, and
we allow for differing variances across segments and species.


```{r}
speciescomp_CFx_descriptive <- lme(absMeanCFxByMechSidesAvg ~ forceType*pressType*species*segment,
                          random = ~1|individual,
                          data = speciescomp_data_nosides,
                          weights = varIdent(form =~ 1|species*segment))


```


One last check - qq plot
```{r}
qqnorm(speciescomp_CFx_descriptive)
```

Ok, still not fabulous, but there may be enough data to not worry about non-normality.





```{r rows.print=56}
summary(speciescomp_CFx_descriptive)
```



```{r rows.print=16}
anova(speciescomp_CFx_descriptive)
```



#####Post-hoc tests
Do post-hoc tests so we can get a sense of what's going on in the model.

```{r}
CLD(emmeans(speciescomp_CFx_descriptive, specs = ~ forceType*pressType | segment*species), adjust = "fdr")
```



```{r}
CLD(emmeans(speciescomp_CFx_descriptive, specs = ~ species | segment*forceType*pressType), adjust = "fdr")
```


# CFx (total) models


####build model

#####Find random effects

Assume individual is a random effect, but let's demonstrate it.
```{r}

testCFxBgTr_0 <- gls(meanCFxInSegmSidesAvg ~ species + segment + species:segment, data = speciescomp_data_nosides)

testCFxBgTr_1 <- lme(meanCFxInSegmSidesAvg ~ species + segment + species:segment, 
                          random = ~1|individual, data = speciescomp_data_nosides)


```



compare models
```{r}
anova(testCFxBgTr_0,testCFxBgTr_1)
```



Do a p-value correction for testing on the boundary
```{r}
0.5*(1-pchisq(6.24547, 1))
```
This p-value is still below our significance threshold, so we accept individual as a random intercept term.

We are data-limited, so take this as our random structure (we don't have enough data to fit random slopes)


#####Check for heterogeneity



Plot residuals
```{r}
plot(testCFxBgTr_1)
```

Possibly some heterogeneity.  Hard to tell if other issues are in play too.
Let's see if there is a cause for heterogeneity.




```{r, fig.width=6, fig.height=3}

grid.arrange(

  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(testCFxBgTr_1))) 
    + geom_boxplot() +papertheme + ggtitle("species"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(testCFxBgTr_1))) 
    + geom_boxplot() + papertheme + ggtitle("segment"),
  
  
  ncol=2
)

```

Variances are unequal across segments and species


```{r, fig.width=8, fig.height=3}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = fitted(testCFxBgTr_1), y = residuals(testCFxBgTr_1), color = segment)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("segment"),
  
  ggplot(speciescomp_data_nosides, aes(x = fitted(testCFxBgTr_1), y = residuals(testCFxBgTr_1), color = species)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("species"),
  
  ncol = 2
)


```

Definitely issues due to both segments and species.


Let's see if adding a variance structure helps.



```{r}

testCFxBgTr_1a <- lme(meanCFxInSegmSidesAvg ~ species + segment + species:segment, 
                      weights = varIdent(form = ~1|species),
                      random = ~1|individual, data = speciescomp_data_nosides)

testCFxBgTr_1b <- lme(meanCFxInSegmSidesAvg ~ species + segment + species:segment, 
                      weights = varIdent(form = ~1|segment),
                      random = ~1|individual, data = speciescomp_data_nosides)

testCFxBgTr_1c <- lme(meanCFxInSegmSidesAvg ~ species + segment + species:segment, 
                      weights = varIdent(form = ~1|species*segment),
                      random = ~1|individual, data = speciescomp_data_nosides)

```




Compare models
```{r}
AIC(testCFxBgTr_1, testCFxBgTr_1a, testCFxBgTr_1b, testCFxBgTr_1c)
```

Model with the variance allowed to vary across species*segments is best by AIC score.

So, look at residuals.


```{r}
plot(testCFxBgTr_1c)
```



Better.




#####Check fixed effects structure
```{r}
anova(testCFxBgTr_1c)
```

The interaction term is highly significant - so we need all the terms in our model.





####Final bg vs tr CFx (total) model


```{r}

speciescomp_CFx_descriptive <- lme(meanCFxInSegmSidesAvg ~ species + segment + species:segment, 
                                   weights = varIdent(form = ~1|species*segment),
                                   random = ~1|individual, data = speciescomp_data_nosides)


```

```{r rows.print=14}
summary(speciescomp_CFx_descriptive)
```


```{r}
plot(speciescomp_CFx_descriptive)
```


```{r}
anova(speciescomp_CFx_descriptive)
```




```{r}
CLD(emmeans(speciescomp_CFx_descriptive, specs = ~ species | segment), adjust = "fdr")
```




# CFy models

We don't care if force is going left or right, just the magnitude.  So let's take the absolute value of CFy.

```{r}
bluegill_data_nosides$absMeanCFyInSegmSidesAvg <- abs(bluegill_data_nosides$meanCFyInSegmSidesAvg)

speciescomp_data_nosides$absMeanCFyInSegmSidesAvg <- abs(speciescomp_data_nosides$meanCFyInSegmSidesAvg)
```




####build model

#####Find random effects

Assume individual is a random effect, but let's demonstrate it.
```{r}

testCFyBgTr_0 <- gls(absMeanCFyInSegmSidesAvg ~ species + segment + species:segment, data = speciescomp_data_nosides)

testCFyBgTr_1 <- lme(absMeanCFyInSegmSidesAvg ~ species + segment + species:segment, 
                          random = ~1|individual, data = speciescomp_data_nosides)


```



compare models
```{r}
anova(testCFyBgTr_0,testCFyBgTr_1)
```



Do a p-value correction for testing on the boundary
```{r}
0.5*(1-pchisq(153.3107, 1))
```
This p-value is still below our significance threshold, so we accept individual as a random intercept term.

We are data-limited, so take this as our random structure (we don't have enough data to fit random slopes)


#####Check for heterogeneity



Plot residuals
```{r}
plot(testCFyBgTr_1)
```

Really bizarre.  For sure some heterogeneity.  Hard to tell if other issues are in play too.
Let's see if there is a cause for heterogeneity.




```{r, fig.width=6, fig.height=3}

grid.arrange(

  ggplot(speciescomp_data_nosides, aes(x = species, y = resid(testCFyBgTr_1))) 
    + geom_boxplot() +papertheme + ggtitle("species"),
  
  ggplot(speciescomp_data_nosides, aes(x = segment, y = resid(testCFyBgTr_1))) 
    + geom_boxplot() + papertheme + ggtitle("segment"),
  
  
  ncol=2
)

```

Variances are unequal across segments and species


```{r, fig.width=8, fig.height=3}

grid.arrange(
  ggplot(speciescomp_data_nosides, aes(x = fitted(testCFyBgTr_1), y = residuals(testCFyBgTr_1), color = segment)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("segment"),
  
  ggplot(speciescomp_data_nosides, aes(x = fitted(testCFyBgTr_1), y = residuals(testCFyBgTr_1), color = species)) 
    + geom_point(alpha = 0.4) + papertheme + ggtitle("species"),
  
  ncol = 2
)


```

Definitely issues due to both segments and species.


Let's see if adding a variance structure helps.
Note that varIdent(form = ~1|species*segment) does not converge.



```{r}

testCFyBgTr_1a <- lme(absMeanCFyInSegmSidesAvg ~ species + segment + species:segment, 
                      weights = varIdent(form = ~1|species),
                      random = ~1|individual, data = speciescomp_data_nosides)

testCFyBgTr_1b <- lme(absMeanCFyInSegmSidesAvg ~ species + segment + species:segment, 
                      weights = varIdent(form = ~1|segment),
                      random = ~1|individual, data = speciescomp_data_nosides)

```




Compare models
```{r}
AIC(testCFyBgTr_1, testCFyBgTr_1a, testCFyBgTr_1b)
```

Model with the variance allowed to vary across segments is best by AIC score - HUGE improvement.

So, look at residuals.


```{r}
plot(testCFyBgTr_1b)
```



Better.  Still some patchiness - likely because our data is sparse.  Vertical spread is much improved.




#####Check fixed effects structure
```{r}
anova(testCFyBgTr_1b)
```

The interaction term is highly significant - so we need all the terms in our model.





####Final bg vs tr CFy model


```{r}

speciescomp_CFy_descriptive <- lme(absMeanCFyInSegmSidesAvg ~ species + segment + species:segment, 
                                   weights = varIdent(form = ~1|segment),
                                   random = ~1|individual, data = speciescomp_data_nosides)


```

```{r rows.print=14}
summary(speciescomp_CFy_descriptive)
```


```{r}
plot(speciescomp_CFy_descriptive)
```


```{r}
anova(speciescomp_CFy_descriptive)
```




```{r}
CLD(emmeans(speciescomp_CFy_descriptive, specs = ~ species | segment), adjust = "fdr")
```






# Efficiency models

Load efficiency data
```{r}
efficiency_master_data <- read_csv("C:/Users/kelse/Desktop/power_eff_2.5_v2.csv")
```


Get rid of the indexing column from Python
```{r}
efficiency_master_data$X1 <- NULL

```



####build model

#####Find random effects

Assume individual is a random effect, but let's demonstrate it.
```{r}

testEffBgTr_0 <- gls(efficiency ~ species, data = efficiency_master_data)

testEffBgTr_1 <- lme(efficiency ~ species, random = ~1|individual, data = efficiency_master_data)


```



compare models
```{r}
anova(testEffBgTr_0,testEffBgTr_1)
```



Do a p-value correction for testing on the boundary
```{r}
0.5*(1-pchisq(6.033351, 1))
```
This p-value is still below our significance threshold, so we accept individual as a random intercept term.

We are data-limited, so take this as our random structure (we don't have enough data to fit random slopes)


#####Check for heterogeneity



Plot residuals
```{r}
plot(testEffBgTr_1)
```

Hard to tell if there's heterogeneity here - but probably not.




```{r, fig.width=6, fig.height=3}

grid.arrange(

  ggplot(efficiency_master_data, aes(x = species, y = resid(testEffBgTr_1))) 
    + geom_boxplot() +papertheme + ggtitle("species"),
  
  
  ncol=2
)

```

Variances are equal across species



#####Check fixed effects structure
```{r}
anova(testEffBgTr_1)
```

Species is not significant.





####Final bg vs tr efficiency model


```{r}

efficiency_model <- lme(efficiency ~ species, random = ~1|individual, data = efficiency_master_data)


```

```{r rows.print=14}
summary(efficiency_model)
```


```{r}
plot(efficiency_model)
```







